Speech perception is vitally important, thus the visual system may possess exceptional sensitivity to track speech-related signals conveyed by lip movements even without awareness of the speaking face. We tested this possibility using a technique called continuous flash suppression (CFS) (e.g., Tsuchiya & Koch, 2005), in which a critical stimulus, here a speaking face, was presented to one eye and strong dynamic-masking noise was presented to the other eye (Figure 1A), rendering the speaking face invisible. We determined whether the visual system could still encode spatiotemporal patterns of lip movements when observers were aware of only the randomly flashing masking display.
A. Trial sequence of a masked-face (face invisible) trial. Each trial began with a fixation cross (0.36° × 0.36°) lasting 3000 ms. While the face was presented to one eye, a strong dynamic mask, consisting of a random array of ...
We measured the encoding of invisible lip movements as crossmodal facilitation of spoken word categorization (e.g., Sumby & Pollack, 1954). Participants determined whether each spoken word was a target word (a tool name) or a non-target word (a name of a non-tool object) while they concurrently viewed a face that either spoke the same word—the congruent condition—or a different word—the incongruent condition.
Prior research suggests that spatial attention influences unaware as well as aware visual processing (e.g., Cohen et al., 2012), and that attention to the mouth region is necessary for lip movements to facilitate spoken word perception (e.g., Alsius et al., 2005; Driver & Spence, 1994). To help direct attention to the mouth region, on half of the trials, we presented the face without the dynamic mask (Figure 1B), and we instructed participants to localize a small probe briefly presented near the mouth (Supplementary Figure S1). These “attention-enforcement” trials were randomly intermixed with the critical masked-face (face invisible) trials. To further enforce attention to the mouth region on the masked-face trials, the probe also appeared on the masked face, and participants were instructed to report its location whenever the face became visible through the mask. Participants reported seeing the face on 5% of the masked-face trials, and the data from those trials were removed from the analyses. If the visual system automatically extracts lip movements even when they are invisible, spoken-word categorization should be facilitated by congruent lip movements even when the face is invisible on the masked-face trials.
Indeed, responses to the spoken target words on the masked-face trails were significantly faster when the lip movements were congruent than incongruent, t45=2.66, p 3.50, ps<0.005).
To control for the possibility that participants might have failed to report face visibility on some of the masked-face trials, we performed a second experiment incorporating a more stringent indicator of face visibility. A tinted translucent ellipse was placed over the mouth region of the face. On each trial, after responding to the spoken word, participants were asked to report the color of the ellipse (red, blue, green, or yellow); critically, they were required to guess if they thought that they had not seen the face. All masked-face trials on which participants correctly reported the color (30%) were removed from analysis.
The same pattern of results was obtained. On the masked-face (face invisible) trials, responses to the spoken target words were significantly faster when the lip movements were congruent than incongruent, t23=2.12, p<0.05 (Figure 1D), with a mean accuracy of 92% with no evidence of a speed-accuracy trade-off. Also consistent with the original experiment, there was no congruency effect for target responses on the attention-enforcement trials, t23=0.80, n.s. (Figure 1F) or for non-target responses (1777ms [congruent] vs. 1743ms [incongruent], t23=1.06, n.s. on the masked-face trials, and 1708ms [congruent] vs. 1779ms [incongruent], t23=1.97, n.s. on the attention-enforcement trials.).
These results demonstrate that even when a speaking face is rendered invisible by a dynamic mask with strong motion signals, the visual system accurately encodes invisible lip movements to facilitate auditory perception of the corresponding spoken words. This crossmodal effect is likely to occur at the level of encoding words; it has been shown that invisible lip movements do not generate a McGurk effect (Palmer & Ramsey, 2012), suggesting that invisible lip movements do not influence auditory perception at the level of encoding syllables.
Dorsal motion processing mechanisms (e.g., V3a, V5) would have predominantly responded to the strong and visible flashing mask (e.g., Moutoussis et al., 2005). The invisible lip movements would thus likely have been processed through the ventral visual pathway including the superior temporal sulcus (STS), an area that selectively responds to biological motion and movements of facial features (e.g., Allison et al., 2000; Calvert & Campbell, 2003; Grossman et al., 2000), and facilitated spoken word perception via multimodal portions of the STS (e.g., Calvert et al., 2000). Sophisticated unconscious processing of static images (e.g., words, faces, sex of human body, and contextual congruence; Jiang et al., 2006; Jiang et al., 2007; Mudrik et al., 2011; Yang et al., 2007) has been demonstrated. Our results extend these prior findings to the processing of dynamic information. Static information can theoretically be extricated from a dynamic mask by temporal averaging. However, unconscious extrication of the subtle dynamics of lip movements from the overwhelming random dynamics of the mask requires sophisticated tuning of the ventral visual system to the behaviorally relevant dynamics.